【转】如何用Spark来实现已有的MapReduce程序
http://rw.baidu.com/forum.php?mod=viewthread&tid=132878
MapReduce从出现以来,已经成为Apache Hadoop计算范式的扛鼎之作。它对于符合其设计的各项工作堪称完美:大规模日志处理,ETL批处理操作等。
随着Hadoop使用范围的不断扩大,人们已经清楚知道MapReduce不是所有计算的最佳框架。Hadoop 2将资源管理器YARN作为自己的顶级组件,为其他计算引擎的接入提供了可能性。如Impala等非MapReduce架构的引入,使平台具备了支持交互式SQL的能力。
今天,Apache Spark是另一种这样的替代,并且被称为是超越MapReduce的通用计算范例。也许您会好奇:MapReduce一直以来已经这么有用了,怎么能突然被取代?毕竟,还有很多ETL这样的工作需要在Hadoop上进行,即使该平台目前也已经拥有其他实时功能。
值得庆幸的是,在Spark上重新实现MapReduce一样的计算是完全可能的。它们可以被更简单的维护,而且在某些情况下更快速,这要归功于Spark优化了刷写数据到磁盘的过程。Spark重新实现MapReduce编程范式不过是回归本源。Spark模仿了Scala的函数式编程风格和API。而MapReduce的想法来自于函数式编程语言LISP。
尽管Spark的主要抽象是RDD(弹性分布式数据集),实现了Map,reduce等操作,但这些都不是Hadoop的Mapper或Reducer API的直接模拟。这些转变也往往成为开发者从Mapper和Reducer类平行迁移到Spark的绊脚石。
与Scala或Spark中经典函数语言实现的map和reduce函数相比,原有Hadoop提供的Mapper和Reducer API 更灵活也更复杂。这些区别对于习惯了MapReduce的开发者而言也许并不明显,下列行为是针对Hadoop的实现而不是MapReduce的抽象概念:
· Mapper和Reducer总是使用键值对作为输入输出。
· 每个Reducer按照Key对Value进行reduce。
· 每个Mapper和Reducer对于每组输入可能产生0个,1个或多个键值对。
· Mapper和Reducer可能产生任意的keys和values,而不局限于输入的子集和变换。
Mapper和Reducer对象的生命周期可能横跨多个map和reduce操作。它们支持setup和cleanup方法,在批量记录处理开始之前和结束之后被调用。
本文将简要展示怎样在Spark中重现以上过程,您将发现不需要逐字翻译Mapper和Reducer!
public class LineLengthMapper extends
Mapper<LongWritable, Text, IntWritable, IntWritable> {
@Override
protected void map(LongWritable lineNumber, Text line, Context context)
throws IOException, InterruptedException {
context.write(new IntWritable(line.getLength()), new IntWritable(1));
}
}
|
lines.map(line => (line.length, 1))
|
public class LineLengthReducer extends
Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
@Override
protected void reduce(IntWritable length, Iterable<IntWritable> counts,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable count : counts) {
sum += count.get();
}
context.write(length, new IntWritable(sum));
}
}
|
val lengthCounts = lines.map(line => (line.length, 1)).reduceByKey(_ + _)
|
public class CountUppercaseMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable lineNumber, Text line, Context context)
throws IOException, InterruptedException {
for (String word : line.toString().split(" ")) {
if (Character.isUpperCase(word.charAt(0))) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}
|
lines.flatMap(
_.split(" ").filter(word => Character.isUpperCase(word(0))).map(word => (word,1))
)
|
public class CountUppercaseReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text word, Iterable<IntWritable> counts, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable count : counts) {
sum += count.get();
}
context
.write(new Text(word.toString().toUpperCase()), new IntWritable(sum));
}
}
|
groupByKey().map { case (word,ones) => (word.toUpperCase, ones.sum) }
|
public class SetupCleanupMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
private Connection dbConnection;
@Override
protected void setup(Context context) {
dbConnection = ...;
}
...
@Override
protected void cleanup(Context context) {
dbConnection.close();
}
}
|
val dbConnection = ...
lines.map(... dbConnection.createStatement(...) ...)
dbConnection.close() // Wrong!
|
lines.mapPartitions { valueIterator =>
val dbConnection = ... // OK
val transformedIterator = valueIterator.map(... dbConnection ...)
dbConnection.close() // Still wrong! May not have evaluated iterator
transformedIterator
}
|
lines.mapPartitions { valueIterator =>
if (valueIterator.isEmpty) {
Iterator[...]()
} else {
val dbConnection = ...
valueIterator.map { item =>
val transformedItem = ...
if (!valueIterator.hasNext) {
dbConnection.close()
}
transformedItem
}
}
}
|